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Abstract 

The  chain-rule  processor  is  a  method  of  constructing  an 
optimal  Bayes  classifier  from  a  bank  of  processors.  Each 
processor  is  a  feature  extractor  designed  to  separate  the 
given  class  from  a  class-dependent  reference  hypothesis, 
thereby  avoiding  the  curse  of  dimensionality.  This  work 
builds  upon  prior  work  in  optimal  classifier  design  using 
class-specific  features.  The  chain-rule  processor  is  an  im¬ 
provement  that  recursively  applies  the  PDF  projection  the¬ 
orem. 


1  Introduction 

The  so-called  M- ary  classification  problem  is  that  of  as¬ 
signing  a  multidimensional  sample  of  data  x  £  1ZN  to  one 
of  M  classes.  The  statistical  hypothesis  that  class  j  is  true 
is  denoted  by  Hj,  1  <  j  <  M.  The  statistical  characteriza¬ 
tion  of  x  under  each  of  the  M  hypotheses  is  described  com¬ 
pletely  by  the  probability  density  functions  (PDFs),  written 
p(x\Hj),  1  <  j  <  M.  Unfortunately,  the  PDFs  are  of¬ 
ten  unknown  in  many  problems  and  must  be  approximated 
from  training  data. 

The  high  dimension  of  the  raw  data  usually  precludes 
estimating  the  PDFs  without  knowing  the  parametric  form 
beforehand.  Because  the  classical  theory  does  not  offer  any 
solutions  to  this  problem,  many  practitioners  are  forced  to 
extract  a  set  of  information-bearing  features  of  lower  di¬ 
mension  from  the  raw  data,  then  abandon  the  raw  data  alto¬ 
gether  by  recasting  the  problem  as  though  the  features  were 
the  raw  data.  We  contend  that  in  problems  of  high  com¬ 
plexity,  this  drastic  step  unnecessary.  In  problems  of  high 
complexity,  where  there  are  many  statistical  hypotheses,  the 
dimension  of  the  feature  space  that  has  enough  information 
to  optimally  separate  all  the  classes  grows  with  the  number 
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of  classes.  But  with  a  fixed  amount  of  training  data,  the 
PDF  of  the  features  can  be  difficult  to  estimate  above  a  very 
small  dimension  (usually  about  5).  Therefore,  classifier  per¬ 
formance  can  be  severely  limited  either  because  of  insuffi¬ 
cient  information  in  a  low-dimensional  feature  set,  or  by 
unreliable  PDF  estimation  at  high  dimensions.  The  root  of 
the  problem  is  that  we  seek  a  common  feature  space  where 
all  classes  are  separable.  Any  feature  that  has  information 
pertaining  to  just  one  or  two  classes  must  be  thrown  into 
the  common  feature  set  regardless  of  whether  it  is  informa¬ 
tive  about  any  other  classes.  The  added  feature  dimensions 
are  often  completely  irrelevant  to  some  of  the  classes,  yet 
these  dimensions  must  be  estimated  under  all  hypotheses  - 
and  this  can  lead  to  very  poor  PDF  estimation.  Fortunately, 
there  is  a  theoretical  solution  to  this  dilemma,  which  we 
now  present. 

2  The  PDF  Projection  Theorem 

In  this  section,  we  describe  the  theorem  that  makes  the 
class-specific  method  possible.  In  Section  3,  we  use  the 
theorem  to  construct  a  classifier. 

2.1  Summary  of  the  Theorem 

For  those  readers  who  would  like  to  skip  most  of  this 
section,  we  summarize  the  main  result  as  follows.  The  PDF 
projection  has  nothing  to  do  with  linear  or  nonlinear  pro¬ 
jection  from  high-dimensions  to  low  dimensions.  Instead, 
it  has  to  do  with  estimating  the  PDF  of  a  select  set  of  fea¬ 
tures  -  only  the  relevant  features  necessary  to  characterize 
a  given  class  -  then  projecting  the  PDF  of  the  low  dimen¬ 
sional  features  back  to  the  high  dimensional  input  raw  data 
space.  That  is,  constructing  a  function  of  the  raw  data  from 
the  feature  PDF  that  not  only  is  a  PDF  so  it  integrates  to  1  on 
the  raw  data  space,  but  is  also  a  good  approximation  to  the 
desired  class  PDF.  This  completely  eliminates  the  need  to 
construct  decision  boundaries  in  a  feature  space  and  makes 
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possible  direct  implementation  of  the  optimal  Bayes  classi¬ 
fier  on  the  raw  data  space.  Let  Hi  be  any  data  class  hypoth¬ 
esis.  Let  zL  =  Ti  (x)  be  a  set  of  features  that  describes  class 
Hi  -  such  as  parameter  estimates  of  the  relevant  parameters 
that  govern  the  data  for  that  class.  Then,  the  projection  op¬ 
eration  is  written 


or  that  the  likelihood  ratio  (with  respect  to  Ho)  is  the  same 
in  both  the  raw  data  and  feature  domains  for  some  pre¬ 
determined  reference  hypothesis  H(j.  We  will  soon  show 
that  this  constraint  produces  desirable  properties.  The  par¬ 
ticular  form  of  /x(x)  is  uniquely  defined  by  the  constraint 
itself,  namely 


Px{x\Hi)  =  Q{x,Ti,H0)  pz{zi\Hi)  (1) 


where  pz(zi  jfTi)  is  the  estimated  low-dimensional  feature 
PDF  of  Zi  and  px(x\Hi)  is  the  projected  PDF  defined  on 
the  raw  input  data  space  -  such  as  raw  time-series  of  raw 
image  pixel  data  -  not  features.  The  projection  or  “Q”  func¬ 
tion 


Q(x,TuH0) 


P(x  \Hp) 
p(zi\Ho) 


(2) 


is  the  ratio  of  the  PDF  at  the  input  and  output  of  the  fea¬ 
ture  transformation  under  some  reference  hypothesis  Ho.  It 
is  determined  exactly  from  the  feature  transformation  and 
does  not  depend  on  training  data,  thus  does  not  suffer  from 
dimensionality  issues.  The  method  requires  that  the  feature 
transformations  are  known  and  needs  access  to  the  raw  data 
in  order  to  compute  equation  (2).  Thus,  it  cannot  be  viewed 
as  another  method  of  constructing  decision  functions  on  a 
given  feature  space. 


/z(x)  =  A(z);  at  z  =  T(x).  (5) 

Theorem  1  proves  that  (5)  is,  indeed,  a  PDF. 

Theorem  1  (PDF  Projection  Theorem).  Let  Hq  be  some 
fixed  reference  hypothesis  with  known  PDF  px(x\Ho).  Let 
X  be  the  region  of  support  ofpx(x\Ho).  In  other  words  X 
is  the  set  of  all  points  x  where  px(x\Ho)  >  0.  Let  z  =  T(x) 
be  a  continuous  many-to-one  transformation  (the  continuity 
requirement  may  be  overly  restrictive).  Let  Z  be  the  image 
of  X  under  the  transformation  T(x).  Let  pz(z\Ho)  be  the 
PDF  of  z  when  x  is  drawn  from  px(x\Ho).  It  follows  that 
pz{z\Ho)  >  0  for  all  z  €  Z.  Now,  let  fz( z)  be  a  any  other 
PDF  with  the  same  region  of  support  Z.  Then  the  function 
(5)  is  a  PDF  on  X,  thus 

f  /x(x)  dx  =  1. 

JxEX 


2.2  Statement  of  Theorem 

It  is  well  known  how  to  write  the  PDF  of  x  from  the  PDF 
of  z  when  the  transformation  is  1:1.  This  is  the  change  of 
variables  theorem  from  basic  probability.  Let  z  =  T(x), 
where  T  (x)  is  an  invertable  and  differentiable  multidimen¬ 
sional  transformation.  Then, 

Px(x)  =  | J (x) |  pz(T(x)),  (3) 

where  |J(x)|  is  the  determinant  of  the  Jacobian  matrix  of 
the  transformation 

T  _  dzi 

dx/ 

What  we  seek  is  a  generalization  of  (3)  which  is  valid  for 
many-to-1  transformations.  Define 

F{T,  fz )  =  {/x(x)  :  z  =  T(x)  and  z  ~  fz( z)}, 

that  is,  T{T,fz)  is  the  set  of  PDFs  fx(x)  such  that  if 
z  =  T(x),  then  z  has  PDF  fz{ z).  If  T(  )  is  many-to- 
one,  T(T.  fz)  will  contain  an  infinite  number  of  members. 
Therefore,  it  is  impossible  to  uniquely  determine  /X(x) 
from  T(  )  and  fz( z).  We  can,  however,  find  a  particular 
solution  if  we  constraint  f  x  (x) .  The  applicable  constraint  is 
that 

/x(x)  =  fz(  z)  ... 

px{x\H0)  pz(z\H0)  ’ 


Furthermore,  fx  (x)  is  a  member  of  IF  (T,  fz). 

Proof:  These  assertions  are  proved  in  a  prior  publication 

[1],[2]. 

2.3  Optimality  Conditions 

While  it  is  interesting  that  fx  (x)  is  a  PDF,  it  is  not  yet 
clear  that  fx  (x)  is  the  best  choice.  We  generally  we  would 
like  to  use  fx (x)  as  an  approximation  to  the  PDF px(x\Hi). 
Define 

Px(x|ffi)  =  ~~7~TT7~~r  Pz(z\Hi)  at  z  =  T(x).  (6) 

Pz(Z \Hq) 

From  Theorem  1,  we  see  that  (6)  is  a  PDF.  Furthermore, 
if  T(x)  is  a  sufficient  statistic  for  Hi  vs  Ho,  then  as 
pz(z\Hi)  pz(z\Hi),  we  have 

Px(x|i?i)  -^px(x\Hi). 

This  situation  produces  an  optimal  classifier  if  it  can  be 
achieved  for  all  classes.  If  the  reader  can  excuse  the  mis¬ 
use  of  the  terms  optimal  and  sufficient,  if  sufficiency  is  ap¬ 
proximate,  it  produces  a  classifier  that  is  approximately  op¬ 
timal.  This  is  because  the  PDF  projection  operator  produces 
a  valid  PDF  regardless  of  sufficiency. 

Theorem  1  allows  maximum  likelihood  (ML)  methods 
to  be  used  in  the  raw  data  space  to  optimize  the  accuracy  of 
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the  approximation.  Let  pz(z\Hi)  be  parameterized  by  the 
parameter  6.  Then,  the  maximization 


If  this  process  is  continued  recursively  to  expand  the  last 
term  each  time,  we  obtain 


max 

0,t,h0 


\pMHq) 

\pz(z\H0) 


is  a  valid  ML  approach  and  can  be  used  for  model  selection 
(with  appropriate  data  cross-validation). 


2.4  Reference  Hypotheses  and  Calculation  of  “Q  " 
function 


One  of  the  issues  in  constructing  the  “Q”  function  (2) 
is  that  it  must  be  possible  to  calculate  both  the  numera¬ 
tor  and  denominator  for  any  value  of  x  or  zt.  Problems 
arise  especially  in  the  PDF  tails.  For  many  feature  transfor¬ 
mations,  where  the  numerator  and  denominator  PDFs  may 
be  derived  analytically,  tail  behavior  is  not  a  problem.  For 
still  more  feature  transformations,  the  PDFs  may  be  accu¬ 
rately  analyzed  in  the  tails  using  the  saddlepoint  approxi¬ 
mation  [3]  or  analyzed  using  asymptotic  CR  bound  analy¬ 
sis.  Subject  to  certain  requirements,  the  reference  hypoth¬ 
esis  H0  can  vary  on-the-fly  to  match  the  input  data  sample 
as  well  as  possible.  This  helps  produce  simpler  and  better- 
conditioned  expressions  for  the  numerator  and  denomina¬ 
tor  of  the  Q-function  and  allows  reduction  of  feature  di¬ 
mension.  A  special  case  of  variable  reference  hypothesis  is 
when  the  features  are  maximum  likelihood  (ML)  estimates. 
The  Q-function  can  be  derived  from  the  Cramer-Rao  lower 
bound.  For  more  information  on  these  issues,  the  reader  is 
referred  to  the  website  [6]. 

2.5  The  Chain  Rule 


In  many  cases,  it  is  difficult  to  derive  the  ratio 
p(x|iTo)/p(zi|-Ho)  for  an  entire  processing  chain.  On  the 
other  hand,  it  may  be  quite  easy  to  do  it  for  one  stage  of 
processing  at  a  time.  This  is  especially  true  because  up¬ 
front  processing  is  usually  simpler  -  FFT,  wavelets,  etc  -  and 
yields  well  to  analysis.  Another  advantage  of  breaking  the 
processing  into  stages  is  to  take  advantage  of  redundancy  in 
the  up-stream  processing.  The  chain  rule  is  just  the  recur¬ 
sive  application  of  the  PDF  projection  theorem.  Consider  a 
processing  chain: 


Tt(x)  T2(  y)  T3(  w) 

x  — y  y  — ^  w  — ^  z 
By  applying  (6)  to  the  first  stage,  we  obtain 


(7) 


P*(x|i?i) 


Px(xH0)  Py{y\H'0) 

Py( y  Ho)  P™(w|i^) 


pw{vf\Ho) 

Pz(z  |i?o) 


Pz(zl#l) 


(9) 


where  Hlh  II .  H'f  are  reference  hypotheses  suited  to  each 
stage  in  the  processing  chain. 

There  is  a  special  embedded  relationship  between  these 
hypotheses.  Let  Hy  be  the  region  of  sufficiency  (ROS)  for 
y,  the  set  of  all  hypotheses  for  which  y  is  a  sufficient  statis¬ 
tic  1 .  Let  'Hy,  and  Hz  be  similarly  defined  for  w  and  z. 
Because  information  is  lost  at  each  stage  in  the  chain,  the 
region  of  sufficiency  shrinks  each  time.  Thus,  we  have 
Hz  £  H  w  £  Hv.  For  optimality  of  the  classifier,  we  also 
require  Hi  £  Hz,  Hf  £  Hz,  H^  £  %w,  H0  £  Hy. 
The  factorization  (9)  together  with  the  embedding  of  the  hy¬ 
potheses  we  call  the  chain-rule  processor  (CRP). 


3  Building  a  Classifier 
3.1  Extending  to  M  Classes 

We  now  consider  the  M-ary  classification  problem.  If  we 
adapt  equations  (1),  (2)  for  multiple  classes  as  follows.  Let 

P(x|flj)  =  <2(x,  Tj,  H0:j)  pizfHj),  (10) 

where  zj  =  Tj(x)  is  a  class-specific  feature  set  for  class 
j  and  p(zj\H j)  is  its  PDF  under  class  Hj.  Notice  that  the 
“Q”-function  uses  a  class-dependent  reference  hypothesis  - 
there  is  no  need  to  make  it  common.  As  we  have  stated, 
p(xjiTj)  so  defined  is  always  a  PDF  (it  integrates  to  1  on 
x)  and  it  is  a  member  of  of  T {Tj ,  fz ) .  The  result  is  the 
class-specific  Neyman  Pearson  classifier 

j*  =  argmax<2j(x,  Tj,H0,j)  p{zj\Hj).  (11) 

3 

Notice  that  the  Q-function  is  a  function  of  x  that  depends 
only  on  the  transformation  Tj  (x)  and  the  reference  hypoth¬ 
esis  Ho.j,  so  it  may  be  determined  a  priori  without  train¬ 
ing.  Qj(x,  Tj,Hoj)  is  the  correct  way  to  compensate  the 
likelihoods  in  a  class-specific  classifier.  The  ability  of  the 
Q-function  to  provide  a  “peak”  at  the  “correct”  feature  set 
gives  the  classifier  a  measure  of  classification  performance 
without  needing  to  train.  In  some  applications,  the  PDF  of 
the  features  is  not  even  needed  -  the  Q-function  does  all  the 
work. 


Px(xjiTi) 


Px  (x 

Ho) 

Pyi  y 

Ho) 

py(y\Hi). 


(8) 


1We  mean  that  for  a  binary  test  between  any  pair  of  hypotheses  in  the 
set,  the  feature  is  a  sufficient  statistic. 


1051-4651/02  $17.00  (c)  2002  IEEE 


3.2  Class-Specific  Modules 

The  derivation  of  Q-functions  for  a  particular  feature 
transformation  is  something  that  needs  to  be  done  only 
once.  Once  complete,  the  Q-function  calculation  can  be 
bundled  together  with  the  feature  calculation  to  arrive  at  a 
software  package  called  a  class-specific  “module”.  Capi¬ 
talizing  on  the  chain-rule,  the  designer  of  a  class-specific 
classifier  can  draw  from  a  library  of  modules  which  can  be 
connected  together  into  “chains”  to  form  each  arm  of  a  clas¬ 
sifier.  All  probabilities  are  represented  in  the  log-domain, 
so  the  log-Q  functions  are  added  together.  At  the  end  of  the 
chain,  the  aggregate  log-Q  function  is  added  to  the  log-PDF 
of  the  features.  Q-functions  have  been  derived  for  many 
features  useful  in  speech  analysis,  time-series  analysis,  and 
general  classification.  As  time  goes  on,  more  feature  sets 
become  available. 

4  Example 


where  e*  =  2  for  k  =  1,  2  . . .  N / 2  —  1  and  1  otherwise,  and 
where  y^  are  the  DFT  magnitude-square  bins 


Vk 


xn  exp(—i2nnk/N) 

71=1 


(12) 


This  can  be  written  as  the  matrix  equation 


z  =  C' 


y, 


(13) 


where  the  definitions  of  y  and  C  are  obvious.  Thus,  we 
can  break  the  ACF  calculation  into  two  stages  :  DFT  (12), 
followed  by  linear  transfonnation  (13). 


4.1  Stage  1 


In  the  first  stage,  we  compute  the  DFT  (12).  For  our 
reference  hypothesis  for  this  stage,  we  use  H o,  the  standard 
normal  density  (WGN  hypothesis  with  unit  variance).  We 
have 


The  class-specific  method  is  not  an  algorithm  like  a  neu¬ 
ral  network  that  can  be  applied  to  an  existing  set  of  features 
and  compared  with  dozens  of  existing  algorithms.  Com¬ 
parison  with  traditional  feature-based  classifiers  on  open 
databases  is  problematic  for  several  reasons.  First,  it  is  a 
method  that  requires  defining  its  own  features  and  requires 
access  to  the  raw  data.  Second,  it  is  highly  dependent  on  the 
choices  that  are  made  in  design  including  features  and  ref¬ 
erence  hypotheses.  The  closest  thing  to  a  “fair”  comparison 
would  be  to  develop  class-specific  features,  then  collect  all 
the  features  from  all  the  classes  into  one  set  and  offer  this  to 
the  conventional  classifier.  Such  an  experiment  is  available 
[4]  and  shows  that  the  conventional  classifier  requires  two 
orders  of  magnitude  (more  than  100  times)  more  training 
data  to  reach  the  same  maximum  performance  level. 

To  illustrate  the  design  of  chain-rule  processors,  we 
develop  the  Q-function  for  the  autocorrelation  function 
(ACF).  Let 

z  =  [f0,h  ■■■fp]', 

where  P  is  the  order  of  the  desired  autoregressive  (AR)  pro¬ 
cess  and  the  circular  autocorrelation  samples  are 


N 

i=  1 


where  [i  +  t]  means  ( i  +  t)  modulo-TV.  Note  that  z  is  re¬ 
lated  by  a  1:1  transformation  to  linear  prediction  (LPC)  co¬ 
efficients  or  reflection  coefficients  (RC)  [5], [3],  and  is  ex¬ 
tremely  useful  for  time-series  analysis  and  spectral  analy¬ 
sis.  It  is  easily  shown  that  an  alternative  way  to  compute  ft 
is  as  follows: 


x  N/2 

ft  =  ^2€kVk  cos(2nkt/N), 
k= 0 


px(-x\H0)  =  (2tt)  "/2exp  |  — .  (14) 

Note  that  under  Ho,  y  is  a  set  of  independent  RVs.  It  is  eas¬ 
ily  shown  that  yo,yN/2  obey  the  y2(l)  density  with  mean 
N  and  variance  2 N2.  Also,  yi  ■  ■  -t/jv/2-i  obey  the  expo¬ 
nential  density  with  mean  N  and  variance  N  2 .  Thus, 

N/2 

p(y\H0)  =  p(yi\H0),  (15) 

i= 0 


where 


p{Vi\Ho) 

and 


(yi/N)~1/2 

N  \/ 2n 


i  =  0,  N/2, 
(16) 


p(yi\H0)  =  ^  exp{-|} 


1  <  i  <  N/2  -  1.  (17) 


4.2  Stage  2 


In  stage  2,  we  compute  (13)  and  use  a  variable  reference 
hypothesis  Ho.  We  let  Hq  be  the  hypothesis  that  y  obeys 
the  AR  spectrum  corresponding  to  z.  Thus,  we  must  use 
the  Levinson  algorithm  to  solve  for  the  P-th  order  AR  co¬ 
efficients  CTq  ,  az .  Let  Az  (k)  be  the  DFT  of  a"  (padded  to 
length  N ),  then 


Pz(k) 


a 


2 

o 


\Az{k)\2 


is  the  AR  spectrum  corresponding  to  az ,  CTq  .  We  let 


yz  =  [Pz(0)...Pz(N/2)]'  (18) 
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We  need  to  evaluate  py(y\Ho)  and  pz(z\Hf).  We  as¬ 
sume  that  under  H q,  {y, }  are  a  set  of  independent  expo¬ 
nential  and  x2  RVs  with  means  corresponding  to  {yf},  the 
elements  of  y2.  Specifically, 

PyiVilHo )  =  ~tt=  {yi/Vi)~1/2  exp ,  (19) 

2/f  V  2tt  l  J 

for  i  =  0,  JV/2,  and  and 

Py(Vi\Ho )  =  exp  j-p- j  j  1  <  i  <  N/ 2-1.  (20) 

Because  i?o  is  “close”  to  z,  we  approximate  pz(z|i?o) 
by  the  central  limit  theorem  (CLT).  Under  II  q.  the  elements 
of  y  are  independent  with  mean  yz  and  diagonal  covariance 

f  2 (y?)a,  i  =  0,JV/2 
1  (yf)2,  \  <i  <  N/2  — 

We  can  then  easily  compute  the  mean  and  covariance  of  z: 

z2  =£(z\H0)  =  C'  y2, 
and 

S2  =  C' s2  c. 

pz{z\H0)  =  (27r)“i£^ii|det(S2)|1/2 

e-J  (z-zoU'(SU_1(z-2o«)  (21) 

-  (27r)-Ii^i|det(S2)|1/2 

where  in  the  last  step,  we  make  the  approximation  z  2  ~  z. 
This  approximation  becomes  better  as  N  becomes  larger. 
Note  also  that  the  method  just  described  is  closely  related  to 
the  ML  approach.  In  fact,  S2  can  be  related  to  the  Fisher’s 
information  of  the  ACF  estimates  [5]. 

4.3  Validation 

It  is  always  important  to  validate  the  Q-function  before 
implementation  of  the  classifier.  To  this  purpose,  we  have 
developed  a  method  for  end-to-end  testing  of  a  chain-rule 
processor  [6].  However,  because  the  current  example  has 
been  analyzed  through  a  different  approach,  it  is  possible 
to  compare  the  results.  Direct  implementation  of  the  Q- 
function  without  breaking  into  two  stages  is  possible  if  we 
can  compute  the  PDF  p(z\Ho).  This  density  has  been  pub¬ 
lished  [3].  We  compared  the  two  methods  in  a  simulation. 
We  created  data  using  an  AR(4)  model.  AR  coefficients 
were  randomly  created  in  each  trial  by  choosing  the  reflec¬ 
tion  coefficients  uniformly  in  the  range  [-1,1],  then  trans¬ 
forming  into  AR  coefficients.  Although  there  is  insufficient 


space  for  a  graph,  the  log-Q  functions  of  the  two  methods 
track  very  closely  across  a  wide  range.  The  maximum  dif¬ 
ference  did  not  exceed  0.5  in  the  log  space  although  the  val¬ 
ues  were  spread  across  a  range  of  more  than  300,  a  factor  of 
e3°o  pjjjs  wjde  range  is  due  to  the  fact  that  samples  are  in 
the  far  tails  of  p(z\H0). 

5  Conclusions 

The  PDF  projection  theorem  makes  it  possible  to  ap¬ 
ply  the  classical  theory  of  hypothesis  testing  in  problems 
where  PDFs  must  be  approximated  from  training  data.  It 
allows  the  PDF  approximation  to  be  carried  out  in  a  class- 
specific  low-dimensional  feature  space,  while  the  likelihood 
comparisons  can  be  made  in  the  common  raw  data  space. 
Doing  this  avoids  the  curse  of  dimensionality  if  the  class- 
dependent  features  are  each  of  low  dimension.  The  theorem 
requires  that  the  statistics  of  the  input  and  output  data  of  the 
transformation  be  known  for  a  particular  class-dependent 
reference  hypothesis.  The  chain-rule  is  the  recursive  appli¬ 
cation  of  the  PDF  projection  theorem.  The  resulting  archi¬ 
tecture,  called  the  chain-rule  processor,  facilitates  the  anal¬ 
ysis  of  complex  chains  of  transformations.  The  ability  to 
change  the  reference  hypothesis  “on  the  fly”  further  facili¬ 
tates  the  analysis.  The  analysis  of  a  chain  which  computes 
the  autocorrelation  function  was  demonstrated.  More  exam¬ 
ples  and  documentation  may  be  found  on  the  website  [6] . 
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